This site moved to dorotac.eu. The version you're seeing now is a preserved copy from April 2023.

# Word embedding fun

I'm terrible at naming things. When others come up with slick user names like "el_duderino", or "pseudolus", I struggle to come up with anything better than "blob123". Right now I'm busy setting up a new project which needs a name, and I struggle with inspiration.

What if there was a machine to transform terrible ideas into good ones? Take a trope, and tweak it a little to get something original?

## Word vectors

I'm not going to pretend that I know exactly what the term "word embedding" means. What matters is that words in a human language can be associated to vectors in a linear space. With a little thinking, we can leverage that property to find words with similar meanings, or solve analogies like this one:

> What is to woman as king is to man?

Of course, it's "queen". The example is the classic one, so please read this because I couldn't explain it better than that article.

Okay, so what about the trope? Let's try with a political meme:

> What's the anarchist version of "fully automated luxury gay space communism"?

## Playing with words

To answer the question, I installed *gensim* from pip, and fetched some ready-made data. Using Python3 interactively, I started this way:

```
import gensim.downloader as api
corpus = api.load('text8')
from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus)
```

Now that I had a vector space with words, I started playing with it, translating the meme word by word. If "fully" is a "communism" term, what's its "anarchism" version?

```
>>> model.wv.similar_by_vector(-model.wv['communism'] + model.wv['anarchism'] + model.wv['fully'])
[('fully', 0.7874691486358643), ('correctable', 0.5339317321777344), ('strictly', 0.533452033996582), ('wholly', 0.5306737422943115), ('totally', 0.5066695213317871), ('properly', 0.4988959729671478), ('locally', 0.4917019009590149), ('entirely', 0.4908182919025421), ('statically', 0.48547613620758057), ('completely', 0.4838995933532715)]
```

Turns out it's still "fully". Going all the way, we get "fully introductory ornamental gay space humanism". That's a bit surprising! According to the model, "communist" is to "communism" as "anarchist" is to "humanism". I blame it on the model being relatively small.

## Readable encyclopedic excellent lesbian topological altruism

This is the best nugget I got out of the *text8* corpus. It seems to be giving a nice selection of words with similar meanings, although changing the political option doesn't seem to do much. Other examples of computer-generated silliness include "perfectly analytical aquarium bisexual seti individualist" and "fully extensible cheddar gay space anarchism". I'm going to keep "fully extensible cheddar" for a different project…

## Using a larger model

Seeing the obvious drawbacks in accuracy and lack of sensitivity to political stances, I tried to use a serious model, based on the Wikipedia. *Gensim* doesn't disappoint in that department either:

```
model = api.load("fasttext-wiki-news-subwords-300")
```

After coming back from the tea break, we are in a 300-dimensional model, containing the condensed knowledge of Wikipedia. Let's see how it fares:

```
>>> model.wv.similar_by_vector(-model.wv['communist'] + model.wv['anarchist'] + model.wv['communism'])
[('anarchism', 0.883374810218811), ('anarchist', 0.8050785064697266), ('Anarchism', 0.7585680484771729), ('anarcho-syndicalism', 0.7496938705444336), ('anarcho-communism', 0.7432658672332764), ('anarchists', 0.7426955699920654), ('anarcho-socialism', 0.7400932312011719), ('anarchistic', 0.7102389931678772), ('anarcho-capitalism', 0.7090942859649658), ('anarchisms', 0.7069689035415649)]
```

Good! It could derive that the "anarchism" equivalent of "communist" is "anarchist".

```
>>> model.wv.similar_by_vector(-model.wv['communism'] + model.wv['anarchism'] + model.wv['fully'])
[('fully', 0.8348307609558105), ('fullly', 0.6537811756134033), ('fuly', 0.6530433893203735), ('thoroughly', 0.6464489102363586), ('completely', 0.6422995328903198), ('properly', 0.6398820877075195), ('adequately', 0.6317397356033325), ('fully-', 0.6239138841629028), ('comprehensively', 0.6175973415374756), ('partially', 0.606654167175293)]
```

Uh, there's a bit of repetition there. What about other words?

```
>>> model.wv.similar_by_vector(-model.wv['communism'] + model.wv['anarchism'] + model.wv['space'])
[('space', 0.8099461793899536), ('sub-space', 0.6615338325500488), ('spaces', 0.6537838578224182), ('work-space', 0.6258285045623779), ('non-space', 0.6244523525238037), ('space--and', 0.6012811064720154), ('workspace', 0.6001818776130676), ('WPspace', 0.5913711786270142), ('space-', 0.5863816738128662), ('space--', 0.5790454745292664)]
```

Disappointing :( I had to dig through a lot of results to get anything not related to the original word. Here are some examples: "adequately anarchist high-end lesbian column-free libertarianism", "completely computerized boutique-style heterosexual hypersphere anti-capitalism".

## Different APIs

There's another API to create analogies: it's the `most_similar` call. It works like this:

```
>>> model.most_similar(positive=['anarchism', 'space'], negative=['communism'])
[('sub-space', 0.6117153763771057), ('spaces', 0.5981306433677673), ('work-space', 0.5736098885536194), ('non-space', 0.5651520490646362), ('workspace', 0.5563109517097473), ('space--and', 0.542439341545105), ('WPspace', 0.5411180257797241), ('space-', 0.5368467569351196), ('meta-space', 0.5321565866470337), ('space--', 0.5256682634353638)]
```

The results are a little different, although I don't know why. Perhaps it's about a different way of measuring similarity.

## Word salad

Using models to reveal some hidden properties of political options didn't work out, but the smaller model was a decent thesaurus. I suspect this is due to less words, creating more varied results near the best answer. I don't know how the 2 APIs differ. While I expected the similarity function to work based on Euclidean distance, it seems to be based on cosine distance instead. Perhaps that's the reason the anchor word sometimes ends up in results as well.

All in all, I think word vectors are going to end up being a nice toy, when some weirdness is needed for brainstorming. Let "aquarium" be an example: it's far out in terms of adjectives, but it is indeed not entirely unrelated to "luxury".

Written on .

Comments

dcz's projects

Thoughts on software and society.

Atom feed (no longer updated)